Skip to content

Stuck-key incident capture: add stdout tail + system load, document GUI-only gap (MAL-57)#907

Merged
malpern merged 5 commits into
masterfrom
claude/wizardly-kapitsa-d02efd
Jun 11, 2026
Merged

Stuck-key incident capture: add stdout tail + system load, document GUI-only gap (MAL-57)#907
malpern merged 5 commits into
masterfrom
claude/wizardly-kapitsa-d02efd

Conversation

@malpern

@malpern malpern commented Jun 11, 2026

Copy link
Copy Markdown
Owner

What

Enriches the stuck-key incident captures written by StuckKeyRecoveryService to ~/Library/Logs/KeyPath/stuck-key-incidents/, closing the gaps that made the 2026-06-10 MAL-57 captures useless or absent:

  1. Kanata stdout tail — the pqrs driver-connection evidence (connected, driver connected: false, output backend unavailable during write, dropping KEY_* Release) lives in stdout, not the stderr tail the capture previously snapshotted. Incident files now include the last 200 stdout lines with the high-volume virtual_hid_keyboard_ready true heartbeat filtered out (1M+ occurrences in the live 175MB log). ... false lines are deliberately kept — they indicate driver disconnects.
  2. System load snapshot — the leading hypothesis is CPU-starvation-induced heartbeat misses, so each capture now records loadavg (via getloadavg) and the top 5 CPU consumers (via SubprocessRunner running ps with a 5s timeout).
  3. GUI-only limitation documented — the detector only runs while KeyPath.app is open; both 2026-06-10 incidents were missed because the GUI was closed. Documented on the service and in a new "2026-06-10 Incident Evidence" section of docs/bugs/MAL-57-duplicate-keypresses.md. Closing this gap (daemon-side detection) is left as future work.

Implementation notes

  • New logTail helper reads only the trailing 1MB of a log file via FileHandle seek — the previous stderr capture loaded the entire file into memory, which would have been 175MB+ for stdout. Truncated reads decode lossily and drop the partial first line, so a seek landing mid-multibyte-character can't degrade the whole tail to "(file not readable)".
  • ps runs through SubprocessRunner.shared (5s timeout) rather than raw Process(), bounding how long a slow ps — plausible under exactly the CPU starvation being diagnosed — can stall the snapshot that recovery awaits before restarting kanata. Process listing is skipped entirely under tests per the KeyPathTestCase rules.
  • Log paths come from KeyPathConstants.Logs instead of re-hardcoded strings.

Reviewer notes

  • The capture-then-restart ordering is unchanged: the snapshot is still awaited before the kanata restart. The new work adds bounded latency (two ≤1MB reads + a timeout-capped ps).
  • Code review gate: ran /code-review (medium) against the branch diff; the confirmed findings (shared path constants, SubprocessRunner reuse + timeout, UTF-8 boundary bug in the tail read) are addressed in the second commit.

Testing

  • 4 new unit tests: spam filtering + line limiting (incl. false-variant survival), missing file, truncated-read partial-line/UTF-8 regression, and test-environment ps skip. Full StuckKeyRecoveryServiceTests suite passes (10 tests, <0.3s).
  • Verified end-to-end on this machine: an incident file produced through the real snapshot path contains the load section and a spam-free stdout tail with the driver events visible.

malpern and others added 2 commits June 10, 2026 23:02
…UI-only gap (MAL-57)

The 2026-06-10 MAL-57 incidents exposed three gaps in the stuck-key
incident capture:

1. Captures snapshotted kanata stderr, but the pqrs driver-connection
   evidence (connected / driver connected: false / output backend
   unavailable / dropping KEY_* Release) lives in stdout. Add a stdout
   tail (last 200 lines) filtering the high-volume
   "virtual_hid_keyboard_ready true" heartbeat; "false" lines are kept
   since they indicate driver disconnects.

2. No system-load context despite CPU starvation being the leading
   hypothesis. Add loadavg (getloadavg) plus the top 5 CPU-consuming
   processes via ps; process listing is skipped under tests.

3. The detector only runs while KeyPath.app is open — both 2026-06-10
   incidents were missed because the GUI was closed. Documented on the
   service and in docs/bugs/MAL-57-duplicate-keypresses.md.

Log tails now read only the trailing 1MB of the file instead of loading
it whole (kanata stdout grows to 175MB+).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…og tail

- Use KeyPathConstants.Logs.kanataStdout/kanataStderr instead of
  re-hardcoding the daemon log paths.
- Run ps via SubprocessRunner with a 5s timeout instead of raw Process —
  bounds how long a slow ps (plausible under the CPU starvation being
  diagnosed) can stall the snapshot that recovery awaits before
  restarting kanata.
- logTail: decode lossily and drop the partial first line of a truncated
  read; previously a 1MB seek landing mid-multibyte-character made
  String(data:encoding:) fail and the whole tail degraded to
  "(file not readable)". Regression test added.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0552a1161

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread Sources/KeyPathAppKit/Services/Monitoring/StuckKeyRecoveryService.swift Outdated
@malpern malpern enabled auto-merge (squash) June 11, 2026 18:55
…ations

Master independently shipped an overlapping capture (#897: tailOfFile,
loadavg/core count, raw ps) plus the test-sandbox redirect for the
incident directory (#901). Resolved to a single implementation keeping
the best of both:

- From master: AppPaths-based diagnosticsDirectory (test sandbox),
  active_cpu_count, loadavg_1m_5m_15m label, maxBytes test parameter.
- From this branch: vhid-ready spam filter (with "false" lines kept),
  1MB tail window, UTF-8-safe lossy decode + partial-first-line drop,
  SubprocessRunner-with-timeout for ps, pid column.
- New (Codex review): explicit truncation marker line when the tail
  window clips older output.

Doc: folded the capture-gap notes into the existing #897 telemetry
paragraph instead of keeping a duplicate same-named section.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Code Review — PR #907

Solid diagnostic improvement overall. The changes directly address the MAL-57 post-mortem gaps (stdout vs stderr, memory-safe tail read, ps latency bound). A few things worth flagging:


Correctness

logTail returns maxLines + 1 when truncated

The truncation marker is inserted after .suffix(maxLines), so logTail(path:, maxLines: 200) can return 201 lines when the tail window is truncated. Non-truncated calls return exactly maxLines. The tests capture this behavior correctly (testLogTailTruncatedReadDropsPartialFirstLine asserts 11, not 10), but callers may be surprised — and the capture file will have one extra line in the common-case where logs exceed 1 MB. Consider either:

  • Capping at maxLines by taking suffix(maxLines - 1) before inserting the marker, or
  • Documenting the off-by-one in the function signature comment so future callers know

Trailing empty line when log files end with \n

components(separatedBy: .newlines) on a \n-terminated file produces a trailing "". logTail doesn't strip it, so the last element of the returned array may be an empty string. Real kanata logs almost certainly end with \n. The test fixture uses joined(separator: "\n") without a trailing newline so the tests don't catch this. The original tailOfFile had the same issue — this PR carries it forward rather than introducing it — but it's worth fixing since we're touching the function anyway. A .filter { !$0.isEmpty } before .suffix(maxLines) would close it.


Style

await placement

await lines.append(contentsOf: topCPUProcessLines(limit: 5))

This compiles (the await covers the whole expression including the async argument), but the more idiomatic form makes clear which sub-expression is async:

lines.append(contentsOf: await topCPUProcessLines(limit: 5))

Minor, but the second form is what you'd expect to see in Swift concurrency code.


What's good

  • Lossy UTF-8 decode (String(decoding:as:)) correctly handles the mid-multibyte-character seek boundary that would have broken the whole chunk with String(data:encoding:). The test with "héartbéat" padding verifies this.
  • SubprocessRunner.shared + 5s timeout — exactly right. A hung ps under CPU starvation is precisely the condition being diagnosed; bounding it prevents the snapshot from stalling the recovery restart.
  • Filter-then-suffix ordering — filtering spam before .suffix(maxLines) is correct. Post-filter, maxLines useful lines are what the analyst actually sees.
  • Truncation marker — the "(… older output truncated…)" prefix is the right signal that filtered spam may have evicted earlier useful lines from the tail window.
  • Test coverage — the four new tests cover the edge cases that matter: spam filtering with false-variant survival, missing file, UTF-8 boundary on truncated read, and subprocess skip in test environment. All well-structured.
  • KeyPathConstants.Logs for log paths — no re-hardcoded strings.

Non-blocking

The two correctness items above are both defensive hardening on diagnostic capture code, not the hot path, so they don't block this. The trailing-empty-line issue in particular is pre-existing. Tagging as suggestions for the next pass if you agree with the direction.

Hardening from claude-review on #907: a \n-terminated log (the normal
case) produced an empty-string artifact as the tail's last element; all
empty lines are now dropped. The truncation marker is documented as
being on top of maxLines.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@malpern

malpern commented Jun 11, 2026

Copy link
Copy Markdown
Owner Author

Addressed the claude-review feedback in ba0bc38: logTail now drops empty lines (including the trailing-newline split artifact — fixture updated to use a \n-terminated file), and the truncation marker's +1 over maxLines is documented in the signature comment. Declined the await placement suggestion: the pinned SwiftFormat 0.61.1 config hoists await to statement position, so the suggested form gets rewritten back on every format.

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Code Review — PR #907

Overview

Solid, well-scoped incident-capture improvement. The 2026-06-10 MAL-57 incidents being useless because (a) the wrong log file was captured and (b) the GUI was closed is exactly the kind of post-incident hardening that pays off. The implementation is clean and the test coverage is thorough.


What's working well

  • Bounded 1MB tail over the previous full-file load is the right call for a 175MB log.
  • Lossy UTF-8 decode (String(decoding:as:UTF8.self)) + dropFirst() on truncated reads correctly handles the seek-landing-mid-multibyte-character case. This is the right fix.
  • SubprocessRunner.shared with timeout: 5 instead of raw Process() bounds the worst case under the very CPU starvation being diagnosed — a self-inflicted stall in the recovery path was a real risk.
  • KeyPathConstants.Logs paths eliminate the re-hardcoded strings the first pass had.
  • %.2f formatting for loadavg is a nice readability improvement over raw Double interpolation.
  • Adding pid to -Aceo pcpu,pid,comm makes process snapshots correlatable with other diagnostics.
  • 4 new unit tests covering spam filtering, partial-line drops, UTF-8 boundary, missing file, and test-environment ps skip — the truncated-read multibyte test in particular is exactly the kind of regression test that earns its weight.

Minor observations

1. await on a mutating append call (style)

In writeSnapshotToDisk and systemLoadLines:

await lines.append(contentsOf: systemLoadLines())
await lines.append(contentsOf: topCPUProcessLines(limit: 5))

This is valid Swift — await on an expression containing an async call works — but it's easy to misread as if append is async. The two-line form is unambiguously clearer:

let loadLines = await systemLoadLines()
lines.append(contentsOf: loadLines)

Low priority, but worth a nit if you're touching this file again.

2. Empty result when all lines are filtered (non-truncated read)

If the tail window fits under 1MB (so offset == 0) but every line in the file matches the spam filter, logTail returns [] with no marker. An incident file section that silently contains nothing is mildly confusing. The truncation marker path (offset > 0) correctly signals partial reads, but there's no equivalent signal for "filter removed everything." In practice this shouldn't happen (the stdout log always has non-heartbeat lines), so this is low priority — just worth being aware of.

3. topCPUProcessLines visibility

The old topCPUProcesses was private; the new topCPUProcessLines is internal (package-level) for testability. This is tested indirectly via systemLoadLines() in testSystemLoadLinesSkipProcessListingInTests, so the extra surface exposure isn't strictly needed. That said, direct testability of the test-environment guard is a reasonable trade-off. No action needed.

4. Truncation marker count guarantee

The docstring says the result "can be maxLines + 1 long." The test asserts tail.count == 11 (10 lines + marker). This is correct, and the testLogTailTruncatedReadDropsPartialFirstLine test validates it. Just confirming the off-by-one is intentional and documented — it is.


Summary

No blocking issues. The code is correct, the edge cases are handled, and the tests cover the important regressions. The await append(contentsOf: asyncFn()) style is the only thing worth a follow-up cleanup if convenient.

Verdict: Approve — ready to merge.

@malpern malpern merged commit a899aa5 into master Jun 11, 2026
5 checks passed
@malpern malpern deleted the claude/wizardly-kapitsa-d02efd branch June 11, 2026 19:22
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Code Review - PR 907

Overview

Enriches StuckKeyRecoveryService incident captures with: a filtered stdout tail (1MB cap, heartbeat spam stripped), a bounded system-load snapshot (loadavg + ps via SubprocessRunner with timeout), and a documented GUI-only limitation. The PR correctly identifies why the 2026-06-10 captures were useless and directly addresses it. Implementation is clean and well-tested.


Strengths

  • Memory-safe tail read. logTail uses FileHandle.seek to read only the trailing 1MB. The previous tailOfFile had a 64KB default that was a latent footgun for 175MB+ stdout logs.
  • Lossy UTF-8 decode. Using String(decoding:as:) instead of String(data:encoding:) is exactly right. A seek landing mid-multibyte character would silently fail the strict decode and return (file not readable) for the entire tail. The partial first line is now correctly dropped with dropFirst().
  • Explicit truncation marker. The sentinel prepended at index 0 prevents a capture dominated by filtered spam from masquerading as complete history. Good discipline.
  • SubprocessRunner + timeout. Replacing raw Process() with a 5s-timeout SubprocessRunner.shared is the right call. A starved ps under exactly the CPU pressure being diagnosed could otherwise stall the recovery restart indefinitely.
  • pid added to ps output. -Aceo pcpu,pid,comm vs the previous -Aceo pcpu,comm is a low-cost improvement -- process name alone is ambiguous when multiple instances run.
  • Test environment guard encapsulated. Moving TestEnvironment.isRunningTests inside topCPUProcessLines rather than at the call site is cleaner and harder to accidentally bypass.
  • Test coverage. Four focused tests cover the core edge cases: spam filtering with false-variant survival, missing file, truncated-read partial-line/UTF-8, and test-environment skip. The truncation test is particularly thoughtful -- a greater-than 1MB fixture with multibyte characters that can straddle the seek boundary.

Issues

Minor: unusual async expression form

In writeSnapshotToDisk and systemLoadLines, both call sites use:

await lines.append(contentsOf: systemLoadLines())
await lines.append(contentsOf: topCPUProcessLines(limit: 5))

Valid Swift (the compiler propagates await into the async argument), but unusual enough to cause a double-take. The conventional form is clearer:

let loadLines = await systemLoadLines()
lines.append(contentsOf: loadLines)

Not a bug -- readability nit only.

Minor: truncation marker reports "1024KB" for the default 1MB window

With maxBytes = 1024 * 1024, the message reads "1024KB" rather than "1MB". Not wrong, but slightly confusing in incident files. Low priority.

Test nit: testLogTailFiltersSpamAndLimitsLines does not assert the truncation marker is absent

The fixture is ~20KB (under the 1MB cap), so no marker is prepended. But the test asserts tail.count == 200 without confirming the file was not truncated. If the fixture ever crosses 1MB, the count assertion breaks confusingly. Consider adding an assertion that tail.first does not contain "truncated".


Non-Issues

  • logTail returning up to maxLines + 1: Documented in the function comment; the truncation test correctly asserts 11 lines for maxLines: 10. Callers have no hard budget.
  • Case-sensitive substring filter: Fine for "virtual_hid_keyboard_ready true" -- kanata log format is stable.
  • Capture-then-restart ordering unchanged: Confirmed by PR description and diff. New async work is bounded (two <=1MB reads + timeout-capped ps).

Verdict

Approve with minor nits. Core logic is correct, memory-safety improvement is important, and test coverage is solid. The two cosmetic issues ("1024KB" vs "1MB", async expression style) can be addressed here or in a follow-on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant